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Abstract 

We solve the ghost-gluon system of Yang-Mills theory using Graphics Processing Units 
(GPUs). Working in Landau gauge, we use the Dyson-Schwinger formalism for the math- 
ematical description as this approach is well-suited to directly benefit from the computing 
power of the GPUs. With the help of a Chebyshev expansion for the dressing functions and 
a subsequent appliance of a Newton-Raphson method, the non-linear system of coupled 
integral equations is linearized. The resulting Newton matrix is generated in parallel using 
OpenMPI and CUDA"""^. Our results show, that it is possible to cut down the run time 
by two orders of magnitude as compared to a sequential version of the code. This makes 
the proposed techniques well-suited for Dyson-Schwinger calculations on more complicated 
systems where the Yang-Mills sector of QCD serves as a starting point. In addition, the 
computation of Schwinger functions using GPU devices is studied. 

Keywords: Ghost-Gluon System, Yang-Mills, Dyson-Schwinger equations, parallel 
computing, CUDA^^programming model, Graphics Processing Units 



1. Introduction 

It is well-established by now that quantum chromodynamics (QCD) provides the necessary 
framework to describe the strong interaction among quarks and gluons. It is furthermore 
believed that confinement, which denotes the absence of color charged objects from the 
physical spectrum, origins from the gauge sector of the theory. Here, the infrared prop- 
erties of the one-particle irreducible Green's functions are of particular interest. Due to 
the large value of the coupling in this low-energy regime, a non-perturbative treatment is 
mandatory. The Dyson-Schwinger (or generally a Green's functions) approach provides a 
continuum formulation of QCD [IH5] capable to describe the system over the entire mo- 
mentum range. Dyson-Schwinger equations (DSEs) constitute a highly coupled system of 
non-linear integral equations, the equations of motion for the underlying quantum field 
theory. If there were the possibility to solve these equations self-consistently, the whole 
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dynamics of the quantum system would be uncovered [6]. Unfortunately, each DSE com- 
prehends higher order Green's functions, such that the whole system builds up to an infinite 
tower of n-point functions and an appropriate truncation is mandatory. The truncation 
procedure is a highly non-trivial task and one has to account for errors induced by the 
particular truncation scheme. On the other hand, and based on the work of [3-[9], there 
has also been substantial progress in solving the whole tower in the far infrared [T0HT2] . 

Initiated by Mandelstam [13] the gluon propagator and later the ghost-gluon system was 
at the focus of many contemporary DSE studies. Recent investigations agree on an en- 
hancement of the ghost whereas the gluon propagator is suppressed in the infrared regime 
[HHin]. This picture is also enforced by results obtained from lattice simulations, see, 
e.g., [201 m] cind references therein, or Functional Renormalization Group methods [22] . 
Whether or not the gluon propagator exactly vanishes in the infrared has been discussed 
for quite some time Although this so-called scaling solution is predicted by continuum 
approaches it is, in four space-time dimensions, not seen in recent lattice calculations of 
the gluon propagator [21], see however [271 - I29] . 

Within the last years, Graphics Processing Units (GPUs) became an essential branch in 
high performance computing. Their efficiency, i.e. the ratio between computational power 
and power consumption, makes them a reasonable alternative to conventional clusters. 
With the availability of low-cost consumer GPU devices this technology also finds its way 
into desktop computers offering the possibility to perform general purpose scientific and 
engineering computations in an until then not feasible way. However, the mapping of ex- 
isting algorithms and/or software to the massively parallel SIMD architecture of the GPU 
is often difficult such that a restructuring of the algorithms/code is inevitable in order to 
meet the requirements of the hardware. In case of the ghost-gluon system only minimal 
modifications are needed such that the portation of the sequential code is a straightforward 
task. The main objective of this paper is to employ the benefits of modern GPU devices 
into DSE calculations. Here, Yang-Mills theory is not only an interesting topic on its own 
but serves also as a starting point for investigations involving fermions since the treatment 
of larger systems by incorporating additional DSEs is possible with the proposed methods. 
Compared to the sequential code, performance gains by two orders of magnitude can be 
obtained. This paper is organized as follows. In Section |2] we introduce the mathematical 
formulation of the problem. In Section [3| a description of the employed numerical methods 
will be given, where the parallelization will be detailed in Section |4| Finally in Section 
|5]we compare the performance of the sequential code with the parallelized versions using 
CUDA^'^as well as OpenMPI and, in addition, the computation of Schwinger functions on 
GPU devices will be outlined. Our summary and conclusions will be given in Section [6j 



^This would correspond to an infrared singular ghost propagator, i.e., a scenario in accordance with 
the Kugo-Ojima ^23^ i24, and Gribov-Zwanziger [53 confinement criterion. 
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2. The Dyson-Schwinger Equations for the Ghost-Gluon System 



Within the DSE approach to QCD the Yang-Mills system is described by a set of coupled 
integral equations for the corresponding ghost and gluon propagator as depicted in Figs. 
l]j2| where throughout this paper Landau gauge is used. Note that the DSE for the gluon 
propagator is already truncated in order to render the system tractabl^ Thus, no two- 
loop as well as no tadpole diagrams are taken into account. The black dots indicate dressed 
propagators whereas the blue blobs represent dressed vertices. 




Figure 1: The DSE for the Landau gauge ghost propagator. 



-1 ' ' irS ^ 



Figure 2: The truncated DSE for the Landau gauge gluon propagator. 
The corresponding formal expression for the ghost propagator reads 

^ r,,o{p,q)D^.{p - q)Mq,p)DG{q), (1) 
where for the gluon propagator it is given by 

^ r,,o{p,q)DG{p-q)r.{q,p)DG{q) 
- Zig^Y J -^^^t^P^fi{P^q)Dpp'{p- q)Tp'ya'{q^p)D„^'{q). 



(2) 



Here, D /T denotes the dressed and bare propagators/vertices respectively. N^. is the num- 
ber of colors and Zi is the renormalization constant for the ghost-gluon vertex which can be 
set to one in Landau gauge as this vertex is UV finite [30]. Furthermore, Zi is the renor- 
malization constant for the three-gluon vertex and Z-^ and Z3 denote the wave-function 
renormalization constants for the ghost and gluon, respectively. For the purpose of this 



^Throughout this section the notation is adapted from PjB] where this system has been solved. 
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numerical study it is sufficient to replace the dressed gliost-gluon vertex by its bare coun- 
terpart, i.e. T^{q,p) = iq^, whereas for the three-gluon vertex the model proposed in [16] 
is employed 

with k"^ = {q — pY- The choice of a = 6 = 35, where 5 = —9/44 is the anomalous dimension 
of the ghost, leads to the correct scaling of the dressing functions in the ultraviolet regionj^ 
Using 0(4) invariance in the Euclidean formulation of QCD, the four dimensional integrals 
can be rewritten in the form 

dS = ^n [ dy^ [ dOsm'^e, (4) 



2 _„ 

where two integrations are carried out trivially yielding a factor of Att. Here we introduced 
the abbreviation = y for the loop momentum|^ In the following we furthermore use 
= X and fc^ = z. For the radial integral we employ a standard Gauss-Leg endre quadrature 
rule, where the nodes are appropriately distributed over the whole momentum range by 
using a non-linear mapping function. For the angular integration we use a tanh-sinh 
quadrature fST] as the integral can be rewritten as 

d9 sinH^ J d^^/l-e, (5) 

with ^ = cos^. This quadrature rule is well-suited for the occurring integrands, showing 
singular behavior at the boundaries of the integration region. Compared to a Gauss- 
Chebyshev quadrature we need much less integration points to get the same - if not better 
- results for the angular integration. With the following ansatz for the ghost propagator 

Doip) = (6) 

as well as for the gluon propagator 

D,M - (V - '-f) ^. (T) 

the corresponding DSEs can be rewritten 

Gix)-' = Zs-ff% [\yy G{y) f di (1 - if' ^ , (8) 

[^T^r Jo J -I ^ 



z{y)z{z)f 



+ §^#3^/0 J ^d^l-ey^'M{x,y,z)G{y)G{z), (9) 



■^For small momenta, the model approaches a constant value, which is reasonable as the gluon loop 
diagram is sub-leading the infrared regime. 

''We regularized the system using a sharp momentum cutoff A^. 
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where the integral kernels reac^ 

M{x,y,z) = 
Qix,y,z) = 
+ 

+ z 
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To introduce a convenient notation the above equations are rewritten 



G{x)-^ 



Z3 + Hzix) , 



and a MOM scheme is applied in the next step 

G{x)-' = g{xg)-' + ndx) - ncixc) 
z{x)-^ = z{xzy^ + nz{x)-nz{xz). 



(12) 
(13) 



(14) 
(15) 



In this renormalization scheme we subtract the equations at some squared momenta xq 
and x^, treating G{xg)~^ and Z{xz)~^ as new parameters to fix the system. Furthermore, 
in the limits of very small/large external momenta x the system can be solved analytically, 
where in the infrared region one findfl 



Z{x < 1) 
G(x < 1) 



Ax2^ 



(16) 
(17) 



with K 0.595353 and some general coefficients A and B. These two coefficients are 
fixed by demanding that the analytical solution must coincide with the numerical solution 
at a specific matching point in the infrared regime. Similar to this, one can apply a 
logarithmic ansatz for the dressing functions in the ultraviolet region 



G{x > 1) 
Zix > 11 



Guv 
Zuv 



win 
oj In 



X 



Xuv 
X 

Xuv 



+ 1 
+ 1 



(19) 



^In the last line an additional term 15/4 is added in order to cancel spurious quadratic divergencies in- 
troduced by the truncation. These divergencies would be absent in a full treatment. A pragmatic approach 
is to remove them by hand by adding an additional term. Furthermore, in the numerical implementation 
we optimized the kernels to avoid unnecessary division operations. 

^See Ref. [TS] for a detailed analysis of the IR/UV behavior of the dressing functions. 
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The analytical treatment yields u = llNca{fi'^)/127i, where Q;(/i^) = g^/4:Tr is the coupling 
at the renormalization scale /z. The constants Guv cind Zjjy are fixed by matching the 
numerical solutions at xuv = cind 7 = —13/22 is the anomalous dimension of the gluon. 
To allow to put in the following sections the focus on the numerical methods only we 
will present the numerical results for the dressing functions, the running coupling and the 
Schwinger functions already here. 

2.1. Dressing Functions and Running Coupling 

With an appropriate choice of G{xg) and Z{xz) the system is fixed, where we use xg = 
and = = 5 X 10^ GeV"^ in our calculations. In Fig. [s] the ghost and gluon dressing 
function is plotted, where Z{xz) = 0.256 and = 1 is usecQ 




-6 -4 -2 2 4 

10 10 10 10 10 10 

X = [GeV^] 

Figure 3: The scaling solution (solid line) as well as two decoupling solutions (dashed lines) for the ghost 
and gluon propagator on a log — log plot. For the latter case, we used G{xg)^^ = 0.01 (dashed line) and 
G'(xg)"^ = 0.1 (dashed-dotted line). 

The choice of G{xg)^^ = leads to an infrared singular ghost dressing function, whereas 
the corresponding gluon dressing function vanishes in this regime. This is the so-called 
scaling solution. A non- vanishing value of G{xg)^^ > yields a decoupling solution which 



''Note that these values are arbitrary in principle. Although, together with our choice for a(/i^) they 
yield the correct experimental value a{M'^) = 0.118 for the running coupling, where Mz — 91.2 GeV is 
the Z-boson mass. The renormalization scale /i is implicitly fixed by specifying a value for a(/z^), where 
the renormalization condition G'^(/i^)Z(//^) = 1 is used. 
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is indicated by the dashed/dashed-dotted lines. In Fig. |4] the non-perturbative running 
couphng a{x) = a{fi^)G'^{x)Z{x) is plotted. 




Figure 4: The non-perturbative running coupling a{x) on a log — linear plot. The scaling solution (solid 
line) is compared to the decoupling solution, where in the latter case G{xg)^^ = 0.01 (dashed line) and 
G{xg)^^ —0.1 (dashed-dotted line) is used. 

The infrared fixed point is a(0) ~ 8.915/Nc = 2.972 in the scaling case. In the decou- 
pling case the running coupling vanishes in the infrared region which agrees with DSE 
calculations on a compact manifold |32] and lattice calculations |33j . 

2.2. Schwinger Functions 

Schwinger functions are a helpful tool when exploring the analytic structure of propa- 
gators. Although a detailed description is beyond the scope of this papeij^ we want to 
note the following: Within the Euclidean formulation of quantum field theory, negative 
norm contributions to a specific propagator correspond to a violation of the Osterwalder- 
Schrader axiom of reflection positivity [35]. Accordingly, this propagator does not possess 
a Kallen-Lehman spectral representation and cannot describe a physical asymptotic state. 
Positivity violation on the level of propagators can be tested with the help of Schwinger 
functions defined as 

A(t):=- / dHcos(t|p|)o-(/)>0, (20) 

Jo 



A concise treatment can be, e.g., found in refs. [51 [M]. 
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where is a scalar function extracted from the according propagator which in case of 

the gluon is given by = Z{p'^)/p'^ [36]. In Fig. [sjwe show the Schwinger function for 

the gluon propagator obtained from our numerical treatment of the ghost-gluon system. 




t [GeV"^] 

Figure 5: Displayed is the absolute value of the gluon propagator's Schwinger function on a log — linear 
plot. In accordance with [36] a zero slightly below t — 5GeV^^ is observed which indicates the violation 
of positivity at a scale of roughly 1 fm. 



3. Outline of the Numerical Method 

The renormalized system of coupled integral equations reads 

G{x)-' = G{xG)-' + nG{x)-nG{xG), (21) 
Z{x)-' = Z{xz)-' + nzix)-nzixz), (22) 

where TIg^x) and Tlzi^x) denote the ghost and gluon self-energy terms, respectively. We 
now employ a Chebyshev expansion for the logarithms of the dressing functions 

l^G'(x) = I + $^&,T,(.(x)), (23) 
i=i 

N-1 

^^Z{x) = | + $^a,T,(.(x)), (24) 
i=i 

where s is a suitable mapping function, which maps the [—1, 1] interval of the Chebyshev 
polynomials to the interval [e^, A^]. We use the one proposed in [l^ which reads 

= (25) 
logio(A/e) 
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The values for the infrared matching point and the ultraviolet cutoff used in our calculations 
are given by = 10~^ GeV^ and = 5 x 10^ GeV^, respectively. Expanding the logarithm 
of the dressing functions leads to better results in the UV regime and to a significant 
reduction of Chebyshev polynomials needed in the expansion ^7\. Furthermore, we split 
the radial integration according to 

px pIs? 

^ + / + / • (26) 



-'0 Je^ Jx 



In our numerical treatment the dressing functions G{x) and Z{x) are evaluated within the 
range x G [e^, A^]. However, due to the appearance of the argument z = x+y—2y/xy cos 9 e 
[0, 4A^] extrapolated values are required. Thus, depending on the region, a function returns 
the appropriate values for G{z) and Z{z) by taking either the numerical or the respective 
analytical solution. We now get a non-linear system for the Chebyshev coefficients, such 
that the following equations have to be fulfilled for all external momenta x 

g{x- a, b) = G{x)-^ - - ndx) + ncixG) = 0, 

zix;a,h) = Z{x)-^~Z{xz)-'-nz{x) + nz{xz) = 0. ^ ^ 



Here, a vector notation for the Chebyshev coefficients is used 



a = (ao, ai, . . . , a^-i)'^ 
b = {bo,bi, . . .,hN-if. 



{21 



In order to treat the unknown 2N Chebyshev coefficients, we evaluate the two equations 
at external momentaj^Xj 

g{xi;a,h) = Gix,)-' - G{xg)~' - Haixi) + ndxa) = 0, (29) 
z(xi;a,b) = Z{x,)-^-Z{xz)-^-nzixi) + nzixz) = 0. (30) 



According to [M] , a Newton- Raphson method is subsequently used to linearize the system. 
It maps the non-linear system of equations for the Chebyshev coefficients to a linear system 
for the so-called Newton improvements which makes a matrix representation possible. 
During each iteration step new improvements are generated by a derivative of the functions 
z and g with respect to the 2N Chebyshev coefficients. These Newton improvements are 
then subtracted from the old coefficients in order to create a new set which is closer to the 
real solution, where for the initialization a starting guess is required. The two equations 
for the coefficients and their improvements reads 



^] - gAa]+\ (31) 
b]+' = b]-QAbf\ (32) 



In our calculations we use the mapped roots of the Chebyshev polynomials. 
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where n is the iteration step. Here, the last terms are additionally decorated with an 
under-relaxation parameter q. One advantage of the Newton method is that if the system 
is close enough to the real solution, the method converges quadratically. This of course 
depends on the starting guess. The under-relaxation parameter is used to relax the system 
close to the real solution also from unlucky choices of the initial conditions. After some 
iteration steps, this parameter can be set back to one in order to benefit from the quadratic 
convergence rate of the method. The Newton improvements Aa and A6 are described by 
a X 2N set of linear equations 



dttj ^ dbj 



0, 



dbj 



(33) 

(34) 



This linear system is represented by the following matrix acting on a vector for the A's 



dz{xo) 


dz{xo) 


dz{xo) 


dz{xo) 


dao 




dbo ■ 


dbN-i 


dz{xN-i) 


dz{xN-i) 


dz{xN-i) 


dz{xN-i 


dao 


daN-1 


dbo ■ 


dbN-i 


dg{xo) 


dg{xo) 


dg{xo) 


dg{xo) 




daN-1 


dbo ' 


dbN-i 


dg{xN-i) 


dg{xN-i) 


dg{xN-i) 


dg{xN-i 


dao 


daN-1 


dbo ■ 


dbN-i 



( Aao \ 

Aoat-i 
A6o 

VAfeiv-J 



/ ^K) \ 

z{xn-\) 

g{xo) 
\gixN-i)J 








(35) 

After the system iterated several times and convergence is achieved, the solutions have 
to vanish. Thus one needs to generate and invert the Newton matrix in order to get a 
new set of Newton improvements during each iteration step. A detailed description on the 
implementation of the numerical method outlined above will be given after some general 
remarks on GPU programming. 



4. GPU Calculations using CUDA™ 

During the last years, programmable Graphic Processor Units (GPUs) became more and 
more important in scientific and engineering high performance computing. Due to their 
multi-threaded manycore processor architecture, they are perfectly well-suited in dealing 
with compute-intensive, parallel programs. There are several programming models avail- 
able like OpenCL™or DirectCompute^'^'^for instance. We use CUDA^'^'^for the numerical 
implementation of our problem, the parallel computing model provided by NVIDIA®. 
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4.1. Kernels, Blocks and Threads 

A CUDA^^code runs on the host CPU as a serial program in which the compute intensive 
parts are condensed into one our more kernel functions to be performed in parallel on the 
GPU device. In order to be scalable, the kernels are structured into a three-level hierarchy 
of threads. The top level is represented by a grid of thread-blocks with up to 1024 threads 
acting within one block. Thread-blocks represent the second hierarchy level and contain the 
same number of threads in each case. A single thread stands at the lowest hierarchy level 
and represents the basic building block of a kernel function. The distribution of thread- 
blocks to the different device multiprocessors is managed by the hardware and there is no 
possibility to decide which block is performed on which multiprocessor at a given time. 
For that reason, blocks have to be independent from each other. 

A single multiprocessor which operates on a specific block executes several threads in 
parallel, grouped to a so-called warp. The individual threads in a warp have the same 
start address in the program but can act on/with different data and are in principle free 
to branch and follow different execution paths. In this case each branch path is executed 
sequentially (warp divergence), where at the end the threads converge back to the original 
execution path and the multiprocessor can handle the next warp. Warp divergence has a 
considerable effect on the run time of the program and best performance is achieved if all 
threads of a warp agree on their particular execution path. 

Each multiprocessor owns a specific number of registers. These memory spaces are dy- 
namically assigned to the corresponding threads currently running on the multiprocessor. 
Whereas each thread can operate on its specific register space, there is also the possibility 
that threads within a block can interchange data via a very fast shared memory. Com- 
munication between different blocks in a grid can only occur via a global memory space 
which is slower than the shared memory. The global memory can be accessed by all the 
threads within the kernels as well as by the host to read/write the corresponding data. 
Although global memory access is still roughly ten times faster than the main memory 
access of a conventional CPU, it is favorable to minimize the communication of threads 
with the global memory during a kernel call. If the memory access is not performed in a 
coalesce way [38| 139] it can be that some threads already operate on their specific data 
while others have to wait for data arriving from the global memory. 

In general the optimization of a CUDA^^program benefits from several tasks. First the 
communication between the host and the device as well as out-of-order accesses to the 
global memory from kernels have to be minimized. Furthermore, the overall performance 
of the program also depends on the number of blocks and the blocksize. Running only 
one block per multiprocessor can force the specific processor to idle because of latencies in 
memory access and/or block synchronization. Launching several blocks decreases register 
and shared memory resources available to a single thread-block. The blocksize strongly 
depends on the workload each thread has to perform and on the number of registers 
available per block. It can very well happen that the possible number of threads is below 
the maximal number of threads given in the device specifications simply because of the 
heavy register usage of compute intensive threads. 
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4-2. Implementation on a Graphics Device with CUDA'^^ 

The central task of the code is to generate the Jacobian matrix for the Newton improve- 
ments. Therefore we spht the matrix into sub-matrices which are distributed onto four 
different kernel functions 



J 



( dz{x^) 
dcio 

dz{xN-i 
dao 

dg{xo) 



dz{xo) dz{xo) 
da 



N-l 



dbn 



dz{xo) \ 
db 



Kernel 1 



Kernel 2 



dz{xN^i) dz{xN-i) 



dao 

dgjxN- 
\ dao 



daN-i 

dgjxo) 
daN-i 



dbo 

dgjxo) 
dbo 



dz{xN-i 
dbN^i 

dg{xo) 



db 



N-l 



Kernel 3 



Kernel 4 



dg{xN-^i) dg{xN-\) 



dajy- 



dbn 



dgjxN^ 
db 



(36) 



N-l 
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These kernels can now be launched on the available GPU devices. Once the matrix elements 
are derived, a data transfer to the host takes place where the inversion of the matrix is 
performed. As the size of the matrix is rather small, the inversion can be done with 
standard LU decomposition routines on the host. The time usage of this procedure is 



negligible compared to the generation of the matrix, 
matrix is split into blocks of size 

/ dz{xo) dz{xo) 



dao 
dz{xi) 

dao 



dai 
dz{xi) 

dai 



GPUl 



dz{xN-i) dz{xN-i) 



\ dao 
t 

Block 1 



dai 
t 

Block 2 



On the particular device the sub- 



dz{xo) \ 

daN-i 
dz{xi) 

daN-i 

dz{xN-i] 
da^-i / 
t 

Block N 



(37) 



These blocks are completely independent, i.e., no data transfer or communication with the 
host or between the threads in the block is needed. In addition, the derivatives can be 
performed by two threads 

dz{xi) z{xi; a\a^+, h) - z{xi; a\a^_^ h) 
= 27, ' 

where finally one of the two threads has to sum up and divide the result by 2ej. In our 
numerical simulation we use ej ~ 10~^aj as the derivatives are symmetric. Note that in 
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each block the same derivatives are used such that the corresponding vectors can be pre- 
calculated in parallel during each iteration step and stored as an array which is, in case of 



the matrix part treated in Eq. (38), of the form 



fao + eo ao 
ai ai + ei 



ai ai — e\ 



ao 
ai 



ao 
ai 



(39) 



These arrays are generated on the GPU devices according to their assigned sub-matrix. 
Subsequently, the blocks can load their specific derivative vector into the shared memory 
with respect to their block and thread IDs. Our code is separated into an initialization 
part running on the host and a main part which is performing the generation of the 
Newton matrix. The first part deals with the memory management, the initialization 
of the Chebyshev coefficients as well as the transfer of the corresponding data to the 
different devices. Note that the weights and nodes for the quadratures, which are generated 
on the host, are stored within the constant memory of the devices. The main part is 
split up into four kernel functions for the four different domains of the Newton matrix as 
well as one kernel function for the generation of the solutions vector. These kernels are 
launched in parallel on the specific devices using the cudaSetDevice instruction. In our 
numerical implementation using OpenMPI each process invokes a GPU device according 
to its specific process ID After the kernels generated their particular part of the matrix 
a data transfer to the host is performed where the LU decomposition/substitution takes 
place. The generation of the derivatives is performed in parallel by an additional kernel in 
the beginning of each iteration step. In the following we show performance results of the 
serial code and compare with a parallelized version using OpenMPI as well as CUDA"'"'^. 



^'^We note that the usage of a separate GPU device for the generation of the solutions vector is less 
efficient. As this step requires an additional compute node, the benefits of the additional resources are 
most probably spoiled by the expensive communication paths between the two machines as there are only 
four GPU devices per machine. Here, the overall performance of the code is improved if one of the four 
MPI processes, in our case the master process, performs the calculation of the solutions vector as well 
as one ghost part of the matrix. The master process collects the data from the other MPI processes via 
a MPLGather operation such that in this setting no data transfer is needed. We note that using this 
configuration also three GPUs would be enough, as the run time of the two processes performing the 
ghost parts is comparable to the run time of a single process performing one gluon part of the matrix. 
Nevertheless, the following performance tests were carried out on four GPUs. 
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5. Performance Results 



We tested various versions of the numerical method outlined in Sec. [3j The easiest approach 
is to perform the calculations on a single CPU devic^^ Subsequently, we parallelized the 
code using OpenMPI which is, due to the nature of the problem, a straightforward task. 
Each element of a Newton sub-matrix is generated in blocks according to the number of 
processes launched, starting from the top left entry. Here, only the master process has to 
allocate memory for the whole matrix since the other processes are calculating just the 
elements. Since the communication load is negligible, this procedure scales quite well with 
the number of processors available. The final step is the GPU implementation, where we 
additionally compare a single and a multi GPU versionp^ 

For the numerical calculations the following hardware is used: 

• Intel Core2 Quad Processor Q9400 @ 2.66GHz (no Hyper-Threading capabilities), 

• Intel Xeon Six-Core X5650 @ 2.67GHz, 

• NVIDIA Geforce GTX 560 Ti 448 Cores, 

• NVIDIA Tesla C2070. 

The Xeons are additionally connected via 40Gb/s QDR InfiniBand, whereas for the Core2s 
a standard Gigabit-Ethernet LAN is used. The following table shows performance results 
obtained on the specific hardware using the different numerical implementations, where 
the value in brackets denotes the number of cores/devices used for the corresponding 
calculation. By varying the number of radial integration nodes and Chebyshev polynomials, 
six different work-loads are employed. 



Nodes 


Q(i) 


Q(12) 


Xe(l) 


Xe(12) 


GTX(l) 


Tesla(l) 


Tesla(4) 


Speed-up 


32 


1.63 


0.14 


3.75 


0.42 


0.08 


0.07 


0.04 


94/11 


64 


3.27 


0.29 


7.34 


0.86 


0.13 


0.11 


0.06 


122/14 


64 


14.80 


1.30 


23.45 


2.05 


0.27 


0.23 


0.10 


235/21 


128 


29.58 


2.58 


48.38 


4.25 


0.53 


0.45 


0.15 


323/28 


64 


54.32 


4.70 


74.22 


6.40 


0.58 


0.40 


0.15 


495/43 


128 


119.33 


9.39 


152.05 


13.08 


1.15 


0.75 


0.27 


563/48 



Table 1: The run time of the code in minutes using 36/60/96 Chebyshev polynomials (upper/middle/lower 
part of the table) for each dressing function. For the radial integral 32/64/128 Gauss-Legendre nodes are 
employed within each of the three integration intervals, cf. eq. (|26[). 



^^The numerical problem discussed here can be solved quite fast on a single CPU, see table 1. The 
current (described) calculations provided in table 1 serve, however, as an important test case because in 
the near future much more involved truncation schemes of DSEs will be investigated. 

^^For an accurate performance test we use rather large workloads and we furthermore employ the fit 
functions proposed in |16j to initiate the system setting the under-relaxation parameter g — 1 in this 
case. Using this setup the system iterates four times, where we implemented a cross sum check over all 
Chebyshev coefficients with an epsilon of e = 10~^. 
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The upper /middle/lower part of Tab. [T] represents the performance test using 36/60/96 
Chebyshev polynomials for each dressing function, where the entries denote the run time 
of the code in minutes. Here, the approximate speed-ups are measured between the slowest 
and the fastest implementation on the one hand (first number) and within a more equitable 
setup by comparing twelve Xeons with four Teslas (second number). With increasing work- 
load, the GPUs perform successively better and we would like to emphasize that already 
with the relatively low-priced consumer card impressive speed-ups can be obtained. In Fig. 
6] we show the scaling of the MPI code when the number of CPU cores is increased (for 
both the Core2s and the Xeons). Here 128 radial integration points are used within each 
integration region as well as 48 Chebyshev polynomials for each dressing function. 




B 

H 



■- -■ Xeon (Hyper-Threading) 




■— ■ Xeon 




A- -▲ Core2 (no Hyper-Threading) 




■A— A Core2 





10 15 
Number of CPU Cores 



Figure 6: The expected linear performance gain of the MPI code when the number of CPU cores is increased 
(sohd hues). Whereas in a direct comparison of the two architectures the Xeons are slower, Intel's Hyper- 
Threading technology increases the efficiency of the server CPUs by approximately 40% (blue dashed line). 
Here, 16/24 threads are launched on twelve CPU cores. The Core2s show a considerable performance break 
down in this case since Hyper- Threading is not available on these devices (red dashed line). 



Both measurements show the expected linear scaling behavior (solid lines). Launching more 
processes than CPU cores available (dashed lines) results in a serious performance drop of 
the Core2s. Due to Intel's Hyper- Threading technology, a single Xeon core can operate on 
two threads concurrently which results in a performance loss of only 20% compared to a 
test run using the corresponding number of real cores. In Fig. [7] the blocksize dependence 
of the CUDA"'"^code for the two different types of GPUs is shown. 
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•-• NVIDIA Tesla C2070 

■-■ NVIDIA Geforce GTX 560 Ti 448 Cores 



32/576 64/288 96/192 128/144 160/116 192/96 
Blocksize / Number of Blocks 

Figure 7: The impact of the blocksize on the performance for test runs on single GPUs using 96 Chebyshev 
polynomials and 128 radial integration points. The axes label denotes the blocksize and the number of 
blocks, respectively (note that two threads are performing a single matrix entry as mentioned in Sec. |4]). 
One can see some slight deviations in the overall behavior of the different types of GPUs but in any case 
choosing the blocksize to be a multiple of a warp leads to optimal performance. Furthermore, these effects 
become noticeable only for rather large workloads. Thus, a blocksize equal to the number of Chebyshev 
polynomials (and equal to a multiple of a warp) is a convenient choice for almost all practical calculations. 



Let us finally comment on the calculation of the Schwinger function. The integral in 
eq. (20) is easily implemented on a GPU device since it is a generalized scalar product and 
can be treated using standard techniques The kernels are decomposed into a one- 

dimensional grid of thread-blocks for the Euclidean time steps where we used strides 
[39] to reduce the results in the end. However, the main improvements can be obtained 
from a parallelized generation of the nodes and weights. For the computation a (mapped) 
Gauss-Legendre quadrature rule is used. The following table shows performance results 
where the entries denote the run time in second^^ 



^■^Hcre, the speed-ups are measured between the GPUs and the Xeons and we used 2^^ Gauss-Legendre 
nodes. Also in this case an OpenMPI version is possible in principle. However, the usage of multiple GPUs 
is not reasonable due to the reduced complexity of this problem. 
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Q(i) 


Xe(l) 


GTX(l) 


Tesla(l) 


sequential 


213.5 


226.9 






CUDA^M 






4.1 


3.9 


speed-up 






55 


58 



Table 2: Performance results for an evaluation of the integral in eq. (20 1. 



6. Conclusions 

We performed a numerical analysis of the ghost-gluon Dyson-Schwinger equations (DSEs) 
of Yang-Mills theory. The truncated system of non-linear integral equations was solved 
with the help of a Chebyshev expansion for the dressing functions using subsequently a 
Newton-Raphson method to obtain a linear system. Here, the methods are ideally suited 
for an SIMD architecture as the problem decomposes into independent parts. The par- 
allelization of the system was performed using OpenMPI and CUDA"'"'^. By comparing 
the two parallelization strategies we demonstrated the computational advantage of GPUs 
for this problem. Compared to a sequential version we obtained speed-ups of approx- 
imately two orders of magnitude already with a single consumer GPU. The presented 
results demonstrate convincingly the benefits of modern GPU devices in DSE calculations, 
and the proposed solution strategy offers a helpful toolbox. 

Last but not least, the generalization to larger systems is straightforward since additional 
DSEs can be incorporated by extending the Newton matrix with the corresponding deriva- 
tives. In this respect we provided a basis for on-going and future computations which 
uses the Yang-Mills DSE system as input. Here, the GPU version does and is expected to 
perform successively better with increasing workload. 
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